Skip to content

[serve][1/n] Introduce gang scheduling#60802

Open
jeffreywang-anyscale wants to merge 7 commits intoray-project:masterfrom
jeffreywang-anyscale:gang-scheduling-v2
Open

[serve][1/n] Introduce gang scheduling#60802
jeffreywang-anyscale wants to merge 7 commits intoray-project:masterfrom
jeffreywang-anyscale:gang-scheduling-v2

Conversation

@jeffreywang-anyscale
Copy link
Contributor

@jeffreywang-anyscale jeffreywang-anyscale commented Feb 6, 2026

Description

This PR introduces gang scheduling support to Ray Serve, enabling atomic scheduling of replica groups. Gang scheduling ensures that sets of replicas are scheduled together -- either all succeed or all fail -- which is critical for distributed serving patterns requiring tight coordination between replicas or across multiple deployments. This is a stepping stone to achieve DP group fault tolerance in Ray Serve LLM.

Key decisions

  • Reserve replicas resources for a gang atomically with PG, and use the reserved PG to spin up replica actors.

Out of scope for this PR

  • RESTART_REPLICA runtime failure policy
  • Autoscaling: Scale up and down with gang quantization
  • Inter-gang placement, e.g. SPREAD across gangs & PACK within gang
  • Node / Label affinity placement
  • Metrics

Test approach

Basic validation

  • Basic gang deployment with @serve.deployment succeeds and responds to requests
  • Basic gang deployment with .options succeeds and responds to requests
  • Gang deployment with insufficient resources can serve requests with scheduled gangs
  • No partial gangs: All gangs has gang_size replicas

Failure validation

  • Gang deployment with invalid configuration
  • Gang deployment with failed scheduling
  • Gang deployment with failed actors
  • Timeout behavior should be the same as standard replica scheduling

Placement strategy

  • Intra-gang placement strategy (PACK and SPREAD)
  • Inter-gang placement strategy

Failover / Fault Tolerance

  • RESTART_GANG runtime failure policy tears down the entire gang, and the following reconciliation loop brings a new gang back up. In the meanwhile, the gang deployment should keep serving traffic with no downtime.

Related issues

RFC: https://docs.google.com/document/d/1IzLTRJo-B8YF74eAKJA6XuDqZtzGOzEtJ1WjkMzroq4/edit?pli=1&tab=t.0

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

Signed-off-by: jeffreywang <jeffreywang@anyscale.com>
Signed-off-by: jeffreywang <jeffreywang@anyscale.com>
Signed-off-by: jeffreywang <jeffreywang@anyscale.com>
Signed-off-by: jeffreywang <jeffreywang@anyscale.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces gang scheduling for Ray Serve, a significant feature for deploying distributed workloads. The changes are extensive, touching core components like the deployment scheduler and state management, adding new configuration options, and providing a solid suite of tests. My review focuses on a few areas for improvement: enhancing type safety in data classes, improving constant management for better code clarity, making cleanup logic more robust by adding logging, and simplifying some conditional logic. Overall, this is a well-implemented feature.

Comment on lines 1086 to 1087
except Exception:
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Catching a broad Exception and passing silently can hide important issues during cleanup. While cleanup should be robust, it's better to at least log the exception to aid in debugging potential problems. For example, if there's a permission issue or a problem with the GCS connection, we would want to know about it.

Suggested change
except Exception:
pass
except Exception as e:
logger.warning(f"Failed to remove placement group {pg.id}: {e}")

Comment on lines +4393 to +4395
gang_placement_strategy=gang_config.gang_placement_strategy.value
if hasattr(gang_config.gang_placement_strategy, "value")
else str(gang_config.gang_placement_strategy),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic to get the string value of gang_placement_strategy can be simplified. Since GangPlacementStrategy inherits from str and Enum, you can just use str() to get the string value. This is cleaner and less defensive.

                gang_placement_strategy=str(gang_config.gang_placement_strategy),

Signed-off-by: jeffreywang <jeffreywang@anyscale.com>
Signed-off-by: jeffreywang <jeffreywang@anyscale.com>
@jeffreywang-anyscale
Copy link
Contributor Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces gang scheduling for Ray Serve, a significant feature for deploying distributed workloads. The implementation is comprehensive, touching core components like the deployment scheduler and state manager. New dataclasses for gang context and requests are added, along with configuration options in DeploymentConfig and the @serve.deployment decorator. The logic for reserving placement groups for gangs and handling the RESTART_GANG failure policy seems well-thought-out. The changes are supported by a good set of unit and end-to-end tests. My review includes a few minor suggestions for improving robustness and code clarity.

Comment on lines +264 to +268
if num_replicas % v.gang_size != 0:
raise ValueError(
f"num_replicas ({num_replicas}) must be a multiple of "
f"gang_size ({v.gang_size})."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The num_replicas value could potentially be None, which would cause a TypeError when the modulo operator is used. While pydantic's default value handling might prevent this, adding a check for num_replicas is not None would make this validator more robust against unexpected None values.

Suggested change
if num_replicas % v.gang_size != 0:
raise ValueError(
f"num_replicas ({num_replicas}) must be a multiple of "
f"gang_size ({v.gang_size})."
)
if num_replicas is not None and num_replicas % v.gang_size != 0:
raise ValueError(
f"num_replicas ({num_replicas}) must be a multiple of "
f"gang_size ({v.gang_size})."
)

Comment on lines +1024 to +1025
f"num_replicas_to_add {request.num_replicas_to_add} "
f"is not divisible by gang_size {gang_size}. "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's a trailing space in the f-string for the error message, which should be removed for cleaner output.

Suggested change
f"num_replicas_to_add {request.num_replicas_to_add} "
f"is not divisible by gang_size {gang_size}. "
f"num_replicas_to_add {request.num_replicas_to_add} "
f"is not divisible by gang_size {gang_size}."

Signed-off-by: jeffreywang <jeffreywang@anyscale.com>
@jeffreywang-anyscale
Copy link
Contributor Author

@cursor review

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

f"gang_size ({v.gang_size})."
)

return v
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validator crashes when num_replicas is None

High Severity

The validate_gang_scheduling_config validator performs num_replicas % v.gang_size without checking if num_replicas is None. The num_replicas field is Optional[NonNegativeInt] and can be None when autoscaling is used (e.g. num_replicas="auto"). Additionally, in Pydantic v1, if num_replicas fails its own validation, it won't be present in values, causing values.get("num_replicas") to return None. In either case, None % v.gang_size raises a TypeError.

Fix in Cursor Fix in Web

int32 max_constructor_retry_count = 20;

// Gang scheduling configuration for atomic replica scheduling.
GangSchedulingConfig gang_scheduling_config = 21;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proto file modified requires fault-tolerance review notice

Low Severity

This PR modifies src/ray/protobuf/serve.proto. Per the "RPC Fault Tolerance Standards Guide" rule:

⚠️ This PR modifies one or more .proto files.
Please review the RPC fault-tolerance & idempotency standards guide here:
https://github.com/ray-project/ray/tree/master/doc/source/ray-core/internals/rpc-fault-tolerance.rst

Fix in Cursor Fix in Web

# Gang PGs are shared across multiple replicas.
# Another replica in the same gang may have already
# removed this PG.
pass
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gang PG silently removed while siblings still running

Medium Severity

When a gang replica finishes stopping, check_stopped removes the shared placement group. Since gang PGs are shared, the first replica to stop removes the PG while sibling replicas may still be running on it. This is especially problematic during downscaling (which has no gang awareness) — individual replicas can be selected for removal, causing the shared PG to be deleted from under still-active gang members. The broad except Exception: pass also silently swallows unrelated errors during PG cleanup.

Fix in Cursor Fix in Web

@jeffreywang-anyscale jeffreywang-anyscale changed the title [serve][wip] Introduce gang scheduling [serve][1/n] Introduce gang scheduling Feb 7, 2026
@jeffreywang-anyscale
Copy link
Contributor Author

Ready for a first pass while I keep adding tests.

@jeffreywang-anyscale jeffreywang-anyscale marked this pull request as ready for review February 7, 2026 07:10
@jeffreywang-anyscale jeffreywang-anyscale requested a review from a team as a code owner February 7, 2026 07:10
@ray-gardener ray-gardener bot added the community-contribution Contributed by the community label Feb 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant